Assembling puzzles from preassembled blocks.

نویسنده

  • P A Pevzner
چکیده

Assembling large jigsaw puzzles is difficult, and most of us haven’t even seen a 10,000 piece puzzle on sale in a toy store. Such puzzles require an enormous dedication, and most children (not to mention adults) are not willing to put the time and effort into their assembly. Moreover, it is only feasible for multi-feature compositions like “Garden of Pleasures” by Hieronymus Bosch (one of the best-selling large puzzles) with hundreds of people and animals. When Celera assembled their first million-piece puzzle (Myers et al. 2000), the Drosophila melanogaster genome, the Public Human Genome Project did not have a program that would reliably assemble even thousand-piece puzzles without errors. Surprisingly enough, there is still no such program in the public domain today. Not to worry: Kent and Haussler (2001) “saved” the Human Genome Project with their GigAssembler. GigAssembler is very different from the Celera assembler: It assembles a million-piece puzzle (genome) from thousands of preassembled blocks (BAC contigs). Each such preassembled block may be composed from thousands of the original pieces (reads). The idea is simple: If you see a blue eye in one preassembled block from the “Garden of Pleasures”, then you are likely to find one more blue eye in another preassembled block. These two blocks should go together and help in the puzzle assembly. There are plenty of “pairs of eyes” in the genome: paired plasmid ends, BAC end pairs, parts of mRNAs or ESTs, and others. The difficulty, however, is (once again!) in repeats: What if there are many blue-eyed people (or animals) in the puzzle? Another problem is that assembly errors in preassembled blocks lead to complications. Unless such incorrectly assembled blocks are broken into correct parts, they may lead to errors in the final assembly. The simple idea behind GigAssembler would not work for the traditional shotgun assembly because the “blue eyes” are not seen in the original small pieces (reads): Every blue eye is broken into many pieces. However, thanks to the hard work of Phrap, Consed (Gordon et al. 1998), and the army of finishers in many sequencing centers worldwide, the blocks have already been preassembled. As a result, instead of assembling millions of short 600-bp reads, Kent and Haussler assemble thousands of much larger (10,000 bp on average) blocks. This approach, of course, assumes that all (or most of) these blocks (BACs) are assembled correctly. Celera apparently was skeptical about the quality of BAC assemblies. Instead of using preassembled blocks, the Celera assembler shred them into small pieces, mimicking the original sequencing reads (Venter et al. 2001). Although Celera reported that a significant number of BAC assemblies conflict with their shotgun data, GigAssembler seems to be able to successfully handle most such misassemblies. There are two surprising things about the GigAssembler algorithm: It is simple and it works. GigAssembler is a greedy algorithm that first assembles the pieces that best fit together and continues with less and less well-fitting pieces. For example, if there are 10 blocks with blue eyes in the puzzle, it is not clear how to combine them into five pairs (not to mention that there are a number of one-eyed creatures in the “Garden of Pleasures”). However, if two of these 10 blocks also come with diamond earrings there is strong evidence that these two pieces should go together. GigAssembler uses mRNA, ESTs, and other types of data as additional pieces of information about block pairing (diamond earrings), and estimates the fitness score for every two blocks of the puzzle. Afterwards, the blocks are assembled in a greedy fashion starting from the best-fit pairs. Greedy algorithms, although simple, face many problems in the conventional shotgun fragment assembly: The shotgun reads may be too short to take advantage of the blue-eye principle and the repeats may be too numerous. Knowing that the greedy strategy does not work well for shotgun assembly, Kent and Haussler made a bet on the fact that preassembled blocks are long enough tomake the blue-eye principle work. After implementing GigAssembler and assembling the public working draft of the human genome, they proved that it is indeed the case (International Human Genome Sequencing Consortium 2001). Although GigAssembler is based on a simple greedy principle, the implementation has to deal with many challenges. The distinguished and unique feature of GigAssembler is the variety of information it uses for assembly. To produce the working draft of the human genome, it had to deal with 400,000 sequence contigs from 30,000 large insert clones, process billions of bases of ESTs, etc. Instead of masking repeats, as done bythe Celera assembler, GigAssembler tries to utilize them and faces a difficult problem of dealing with both accurate contig data and rather inaccurate EST and BAC end single reads data. Another problem is how to deal with the conflicts and how to position the contigs that do not overlap: What if the blueeye principle suggests that a block A should go after a block B, while the brown-eye principle suggests that A should go after another block C? Fortunately, a similar problem has been addressed in the past for physical mapping (Thayer et al. 1999). In computer science it is known as constraint satisfaction, and Kent and Haussler take advantage of the classical Bellman-Ford algorithm to resolve conflicts in GigAssembler. The success of GigAssembler may influence the future decisions on how to proceed with other genomic sequencing projects. The question of whether the future belongs to the shotgun approach or to the BAC-by-BAC approach or to a hybrid approach is still being debated. In the end it will boil down to the economics: The least expensive and the most accurate approach will prevail. Unfortunately, it is not easy to predict the cost for each of these approaches and one of the unknown variables is that we don’t know yet the limits of the next generation of assembly algorithms. The debates about the optimal way to sequence genomes are not new. They started four years ago with two famous back-to-back Genome Research papers: “Human whole-genome shotgun sequencing” by James Weber and Gene Myers (Weber and Myers 1997) and “Against a wholegenome shotgun” by Phil Green (Green 1997). After Myers and colleagues published the algorithm behind the Celera assembler and assembled theDrosophila melanogaster geE-MAIL [email protected]; FAX (858) 534– 7029. Article and publication are at http://www.genome. org/cgi/doi/10.1101/gr.206301. Insight/Outlook

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Generalized Genetic Algorithm-Based Solver for Very Large Jigsaw Puzzles of Complex Types

In this paper we introduce new types of square-piece jigsaw puzzles, where in addition to the unknown location and orientation of each piece, a piece might also need to be flipped. These puzzles, which are associated with a number of real world problems, are considerably harder, from a computational standpoint. Specifically, we present a novel generalized genetic algorithm (GA)based solver that...

متن کامل

Modular Design of Self-Assembling Peptide-Based Nanotubes.

An ability to design peptide-based nanotubes (PNTs) rationally with defined and mutable internal channels would advance understanding of peptide self-assembly, and present new biomaterials for nanotechnology and medicine. PNTs have been made from Fmoc dipeptides, cyclic peptides, and lock-washer helical bundles. Here we show that blunt-ended α-helical barrels, that is, preassembled bundles of α...

متن کامل

Pushing blocks is hard

We prove NP-hardness of a wide class of pushing-block puzzles similar to the classic Sokoban, generalizing several previous results [5, 6, 9, 10, 15, 17]. The puzzles consist of unit square blocks on an integer lattice; all blocks are movable. The robot may move horizontally and vertically in order to reach a specified goal position. The puzzle variants differ in the number of blocks that the r...

متن کامل

Building programmable jigsaw puzzles with RNA.

One challenge in supramolecular chemistry is the design of versatile, self-assembling building blocks to attain total control of arrangement of matter at a molecular level. We have achieved reliable prediction and design of the three-dimensional structure of artificial RNA building blocks to generate molecular jigsaw puzzle units called tectosquares. They can be programmed with control over the...

متن کامل

Push-2-f is pspace-complete

We prove PSPACE-completeness of a class of pushingblock puzzles similar to the classic Sokoban, extending several previous results [1, 5, 12]. The puzzles consist of unit square blocks on an integer lattice; some of the blocks are movable. The robot may move horizontally and vertically in order to reach a specified goal position. The puzzle variants differ in the number of blocks that the robot...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Genome research

دوره 11 9  شماره 

صفحات  -

تاریخ انتشار 2001